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Abstract 

We present a general framework based on weighted finite automata 
and weighted finite-state transducers for describing and implementing 
speech recognizers. The framework allows us to represent uniformly the 
information sources and data structures used in recognition, including 
context-dependent units, pronunciation dictionaries, language models 
and lattices. Furthermore, general but efficient algorithms can used 
for combining information sources in actual recognizers and for opti- 
mizing their application. In particular, a single composition algorithm 
is used both to combine in advance information sources such as lan- 
guage models and dictionaries, and to combine acoustic observations 
and information sources dynamically during recognition. 

1 Introduction 



Many problems in speech processing can be usefully analyzed in terms of 
the "noisy channel" metaphor: given an observation sequence o, find which 
intended message w is most likely to generate that observation sequence by 
maximizing 

P(w,o) = P{o\w)P(w), 

where P(o\w) characterizes the transduction between intended messages and 
observations, and P{w) characterizes the message generator. More generally, 
the transduction between messages and observations may involve several 



1 



stages relating successive levels of representation: 
P(s ,Sk)=P(sk\so)P(s ) 

P(s fc |s )=E sl ,..., Sfc _ 1 -P(sfc|sifc-i)---P(si|so) 1 

Each Sj is a sequence of units of an appropriate representation, for instance 
phones or syllables in speech recognition. A straightforward but useful ob- 
servation is that any such a cascade can be factored at any intermediate 
level 

P(s j \s i )=Y,P(sj\si)P(si\s l ) (2) 

si 

For computational reasons, sums and products in (|l|) are often replaced 
by minimizations and sums of negative log probabilities, yielding the ap- 
proximation 

P(s ,s k ) = P(s k \s ) + P(s ) , , 

P(s k \s ) « min Slr .. )Sfc _ 1 Ei<,<fc-P(silsi-i) 

where X = — logX. In this formulation, assuming the approximation is 
reasonable, the most likely message sq is the one minimizing P(so,s k )- 

In current speech recognition systems, a transduction stage is typically 
modeled by a finite-state device, for example a hidden Markov model (HMM). 
However, the commonalities among stages are typically not exploited, and 
each stage is represented and implemented by "ad hoc" means. The goal 
of this paper is to show that the theory of weighted rational languages and 
transductions can be used as a general framework for transduction cascades. 
Levels of representation will be modeled as weighted languages, and trans- 
duction stages will be modeled as weighted transductions. 

This foundation provides a rich set of operators for combining cascade 
levels and stages that generalizes the standard operations on regular lan- 
guages, suggests novel ways of combining models of different parts of the de- 
coding process, and supports uniform algorithms for transduction and search 
throughout the cascade. Computationally, stages and levels of representa- 
tion are represented as weighted finite automata, and a general automata 
composition algorithm implements the relational composition of successive 
stages. Automata compositions can be searched with standard best-path 
algorithms to find the most likely transcriptions of spoken utterances. A 
"lazy" implementation of composition allows search and pruning to be car- 
ried out concurrently with composition so that only the useful portions of 
the composition of the observations with the decoding cascade is explicitly 
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created. Finally, finite-state minimization techniques can be used to reduce 
the size of cascade levels and thus improve recognition efficiency [12]. 

Weighted languages and transductions are generalizations of the stan- 
dard notions of language and transduction in formal language theory [^, ^] . 
A weighted language is a mapping from strings over an alphabet to weights, 
while a weighted transduction is a mapping from pairs of strings over two 
alphabets to weights. For example, when weights represent probabilities and 
assuming appropriate normalization, a weighted language is just a proba- 
bility distribution over strings, and a weighted transduction a conditional 
probability distribution between strings. The weighted rational languages 
and transducers are those that can be represented by weighted finite-state 
acceptors (WFSAs) and weighted finite-state transducers (WFSTs), as de- 
scribed in more detail in the next section. In this paper we will be concerned 
with the weighted rational case, although some of the theory can be prof- 
itably extended more general language classes closed under intersection with 
regular languages and composition with rational transductions |9|, [22|] . 

The notion of weighted rational transduction arises from the combi- 
nation of two ideas in automata theory: rational transductions, used in 
many aspects of formal language theory H, and weighted languages and 



automata, developed in pattern recognition [||, 15] and algebraic automata 
theory |3|, |B|, ||. Ordinary (unweighted) rational transductions have been 
successfully applied by researchers at Xerox PARC and at the University 



of Paris 7 |1^, 14, 19, p0| , among others, to several problems in language pro- 
cessing, including morphological analysis, dictionary compression and syn- 
tactic analysis. HMMs and probabilistic finite-state language models can be 
shown to be equivalent to WFSAs. In algebraic automata theory, rational 
series and rational transductions Q are the algebraic counterparts of WF- 
SAs and WFSTs and give the correct generalizations to the weighted case 
of the standard algebraic operations on formal languages and transductions, 
such as union, concatenation, intersection, restriction and composition. We 
believe our work is the first application of these generalizations to speech 
processing. 

While we concentrate here on speech recognition applications, the same 
framework and tools have also been applied to other language processing 
tasks such as the segmentation of Chinese text into words [21]. We explain 
how a standard HMM-based recognizer can be naturally viewed as equivalent 
to a cascade of weighted transductions, and how the approach requires no 
modification to accommodate context dependencies that cross higher-level 
unit boundaries, for instance cross-word context-dependent models. This is 
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an important advantage of the transduction approach over the usual, but 
more limited "substitution" approach used in existing to speech recognizers. 
Substitution replaces a symbol at a higher level by its defining language at 
a lower level, but, as we will argue, cannot model directly the interactions 
between context-dependent units at the lower level. 

2 Theory 

2.1 The Weight Semiring 

As discussed informally in the previous section, our approach relies on asso- 
ciating weights to the strings in a language, the string pairs in a transduc- 
tion and the transitions in an automaton. The operations used for weight 
combination should reflect the intended interpretation of the weights. For 
instance, if the weights of automata transitions represent transition proba- 
bilities, the weight assigned to a path should be the product of the weights of 
its transitions, while the weight (total probability) assigned to a set of paths 
with common source and destination should be the sum of the weights of the 
paths in the set. However, if the weights represent negative log-probabilities 
and we are operating under the Viterbi approximation that replaces the sum 
of the probabilities of alternative paths by the probability of the most prob- 
able path, path weights should be the sum of the weights of the transitions 
in the path and the weight assigned to a set of paths should be the minimum 
of the weights of the paths in the set. Both of these weight structures are 
special cases of commutative semirings, which are the basis of the general 
theory of weighted languages, transductions and automata ||, [51 B . 

In general, a semiring is a set K with two binary operations, collection 
+K and extension xk, such that: 

• collection is associative and commutative with identity Ok', 

• extension is associative with identity Ik] 

• extension distributes over collection; 

• a Xk Ok = Ok Xk a = for any a G K. 

The semiring is commutative if extension is commutative. 

Setting K = R + with + for collection, x for extension, for Ok and 
1 for Ik we obtain the sum-times semiring, which we can use to model 
probability calculations. Setting K = R + U {oo} with min for collection, 
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+ for extension, oo for Ok and for Ik we obtain the min-sum semiring, 
which models negative log-probabilities under the Viterbi approximation. 

In general, weights represent some measure of "goodness" that we want 
to optimize. For instance, with probabilities we are interested in the highest 
weight, while the lowest weight is sought for negative log-probabilities. We 
thus assume a total order on weights and write max^ f(x) for the optimal 
value of the weight-valued function / and argmax a ,/(x) for some x that 
optimizes f(x). We also assume that extension and collection are monotonic 
with respect to the total order. 

In what follows, we will assume a fixed semiring K and thus drop the 
subscript K in the symbols for its operations and identity elements. Unless 
stated otherwise, all the discussion will apply to any commutative semir- 
ing, if necessary with a total order for optimization. Some definitions and 
calculations involve collecting over potentially infinite sets, for instance the 
set of strings of a language. Clearly, collecting over an infinite set is al- 
ways well-defined for idempotent semirings such as the min-sum semiring, 
in which a + a = a Va G K . More generally, a closed semiring is one in 
which collecting over infinite sets is well defined. Finally, some particular 
cases arising in the discussion below can be shown to be well defined for the 
plus-times semiring under certain mild conditions on the weights assigned 
to strings or automata transitions [Q, [g. 

2.2 Weighted Transductions and Languages 

In the transduction cascade each stage corresponds to a mapping from 
input-output pairs (r, s) to probabilities P(s\r). More formally, stages in the 
cascade will be weighted transductions T : E* x T* — > K where E* and T* are 
the sets of strings over the alphabets E and T, and K is the weight semiring. 
We will denote by T _1 the inverse of T defined by T(t, s) = T(s, t). 

The right-most step of (||) is not a transduction, but rather an informa- 
tion source, the language model. We will represent such sources as weighted 
languages L : E* — > K. 

Each transduction S : E* x T* — > K has two associated weighted lan- 
guages, its its first and second projections ~K\{S) :Yi*—>K and vr2(5) : T* — > 
K, defined by 

ki(S)(s) = Z teT *S(s,t) 

MS)(t) = E s ^s(s,t) 

Given two transductions S : E* x T* — > K and T : T* x A* — » K, we 
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define their composition S o T by 

(SoT)(r,t)=Y,S(r,s)xT(s,t) (4) 

For example, if S represents P(si\si) and T P(sj\si) in (0), SoT represents 
P(*j|si). 

A weighted transduction S : x T* — ► can be also applied to a 
weighted language L : S* — > if to yield a weighted language 5[L] over T: 

S[L](s)= L(r)xS(r,s) (5) 

re£* 

We can also identify any weighted language L with the identity trans- 
duction restricted to L: 

L(r r') = { L<y ^ if r = r ' 
1 otherwise 

Using this identification, application is transduction composition followed 
by projection: 

7r 2 (LoS)(s) = E reP Ert S .%r')x5(r' )S ) 
= E re s* x S(r,s) 

S[L](s) 

From now on, we will take advantage of the identification of languages 
with transductions and use o to express both composition and application, 
often leaving implicit the projections required to extract languages from 
transductions. In particular, the intersection of two weighted languages 
M, N : S* -> K is given by 

tti(M o N)(s) = tt 2 (M o AT)(s) = M(s) x JV(s) (6) 

It is easy to see that composition is associative, that is, the result of any 
transduction cascade R± o ■ ■ ■ o R m is independent of order of application of 
the composition operators. 

For a more concrete example, consider the transduction cascade for 
speech recognition depicted in Figure |], where A is the transduction from 
acoustic observation sequences to phone sequences, D the transduction from 
phone sequences to word sequences (essentially a pronunciation dictionary) 
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Figure 1: Recognition Cascade 





Transduction 


singleton 


{(«, v)}(w, z) = 1 iff u = w and v = z 


scaling 


(kT)(u,v) = k x T(u,v) 


sum 


(S + T)(u,v) = S(u,v) +T(u,v) 


concatenation 


(ST)(t, w) = Ers=t,uv= w S(r, it) x T(s, v) 


power 


T°(e,e) = 1 




T°(u^e,v^e) = 






closure 


T* = Z k > T k 



Table 1: Rational Operations 



and M a weighted language representing the language model. Given a par- 
ticular sequence of observations o, we can represent it as the trivial weighted 
language O that assigns 1 to o and to any other sequence. Then OoA rep- 
resents the acoustic likelihoods of possible phone sequences that generate o, 
OoAoD the acoustic-lexical likelihoods of possible word sequences yielding 
o, and O o A o D o M the combined acoustic-lexical-linguistic probabilities 
of word sequences generating o. The word string w with the highest weight 
in 7T2(0 o A o D o M) is the most likely sentence hypothesis generating o. 

Composition is thus the main operation involved in the construction and 
use of transduction cascades. As we will see in the next section, composi- 
tion can be implemented as a suitable generalization of the usual intersection 
algorithm for finite automata. In addition to composition, weighted trans- 
ductions (and languages, given the identification of languages with trans- 
ductions presented earlier) can be constructed from simpler ones using the 
operations shown in Table |], which generalize in a straightforward way the 
regular operations well-known from traditional automata theory ||. In fact, 
the rational languages and transductions are exactly those that can be built 
from singletons by applications of scaling, sum, concatenation and closure. 

For example, assume that for each word w in a lexicon we are given 
a rational transduction D w such that D w (p,w) is the probability that w 
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is realized as the phone sequence p. Note that this allows for multiple 
pronunciations for w. Then the rational transduction (%2 W D w )* gives the 
probabilities for realizations of word sequences as phone sequences if we leave 
aside cross-word context dependencies, which will be discussed in Section ||. 

2.3 Weighted Automata 

Kleene's theorem states that regular languages are exactly those repre- 
sentable by finite-state acceptors || . Generalized to the weighted case and to 
transductions, it states that weighted rational languages and transductions 
are exactly those that can be represented by weighted finite automata pL ||] . 
Furthermore, all the operations on languages and transductions we have dis- 
cussed have finite-automata counterparts, which we have implemented. Any 
cascade representable in terms of those operations can thus be implemented 
directly as an appropriate combination of the programs implementing each 
of the operations. 

A K -weighted finite automaton A is given by a finite set of states Qa, 
a set of transition labels A a, an initial state ia, a final weight function 
Fa ■ Qa — * K, Q and a finite set 5a C Qa x A-a x K x Qa of transitions 
t = (i.src, t.l&b,t.w, t.dst). The label set A^ must have with an associative 
concatenation operation u ■ v with identity element 6a- A weighted finite- 
state acceptor (WFSA) is a iT-weighted finite automaton with A^ = S* 
for some finite alphabet £. A weighted finite-state transducer (WFST) is 
a iT-weighted finite automaton such that Aa = T,* x T* for given finite 
alphabets X and T, its label concatenation is defined by (r, s) ■ (u, v) = 
(ru,sv), and its identity (null) label is (e, e). For I = (r,s) € X* x T* we 
define /.in = r and Lout = s. As we have done for languages, we will often 
identify a weighted acceptor with the transducer with the same state set and 
a transition (g, (x, x), k, q') for each transition (q, x, k, q') in the acceptor. 

A path in an automaton A is a sequence of transitions p = t±, . . . ,t m 
in 5a with tj.src = tj_i.dst for 1 < i < k. We define the source and the 
destination of p by p. src = ti.src andp.dst = t m .dst, respectively. The label 
of p is the concatenation p.l&b = ti.lab t m .lab, its weight is the product 

lr The usual notion of final state can be represented by Fa(q) = 1 if q is final, Fa{q) = 
otherwise. More generally, we call a state final if its weight is not 0. Also, we will interpret 
any non-weighted automaton as a weighted automaton in which all transitions and final 
states have weight 1. 

2 For convenience, for each state q € Qa we also have an empty path with no transitions 
and source and destination q. 
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p.w = t\.w x • • • x t m .w and its acceptance weight is F(p) = p.w x F^p.dst). 
We denote by Pa(q, Q 1 ) the set of all paths in A with source q and destination 
q', by Pa(q) the set of all paths in A with source q, by P^(q,q') the subset 
of Pa(q, q') with label u and by P^(q) the subset of Pa(q) with label it. 

Each state q G defines a weighted transduction (or a weighted lan- 
guage): 

L A (q)(u)= £ F(p) . (7) 

Finally, we can define the weighted transduction (language) of a weighted 
transducer (acceptor) A by 

{A} = LaHa) ■ (8) 

The appropriate generalization of Kleene's theorem to weighted acceptors 
and transducers states that under suitable conditions guaranteeing that the 
inner sum in (0) is defined, weighted rational languages and transductions 
are exactly those defined by weighted automata as outlined here J|]. 

Weighted acceptors and transducers are thus faithful implementations 
of rational languages and transductions, and all the operations on these 
described above have corresponding implementations in terms of algorithms 
on automata. In particular, composition is implemented by the automata 
operation we now describe. 

2.4 Automata Composition 

Informally, the composition of two automata A and B is a generalization of 
NFA intersection. Each state in the composition is a pair of a state of A 
and a state of B, and each path in the composition corresponds to a pair 
of a path in A and a path in B with compatible labels. The total weight of 
the composition path is the extension of the weights of the corresponding 
paths in A and B. The composition operation thus formalizes the notion of 
coordinated search in two graphs, where the coordination corresponds to a 
suitable agreement between path labels. 

The more formal discussion that follows will be presented in terms of 
transducers, taking advantage the identifications of languages with trans- 
ductions and of acceptors with transducers given earlier. 

Consider two transducers A and B with = S* x T* and A# =PxA*. 
Their composition A to B will be a transducer with AamB = X* x A* such 
that: 

{A co B] = {A} o {Bj . (9) 
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By definition of L.(-) and o we have for any q G Qa and q' G Qb- 



(L A (q) o L B (q'))(u,w) 
= E v er* (£ pe j>K«) to) x (E p , 6F (.,») (g0 F( P ')) 

= Ever* E peP (M,» )(q) E p , eP ^) {ql) Hp) x F(p') 

= E(p,p')eJ(q,q',u,w) F (P) X F (P') 



(10) 



where J(q,q' ,u,w) is the set of pairs (p,p') of paths p G -Pa(<?) and p' € 
Psiq') such that p. lab. in = u, p. lab. out = p'. lab. in and p' . lab. out = w. In 
particular, we have: 

(lA]o[B])(u,w)= E F{p)xF(p') . (11) 

(py)eJ(iA,«B,",u') 

Therefore, assuming that @ is satisfied, this equation collects the weights 
of all paths p in A and p' in B such that p maps u to some string f and p' 
maps v to w. In particular, on the min-sum weight semiring, the shortest 
path labeled (u, w) in \A txi B\ minimizes the sum of the costs of paths 
labeled (u, v) in A and (v,w) in B, for some s. 

We will give first the construction of A x B for e-free transducers 
A and B, that is, those with transition labels in X x T and T x A, re- 
spectively. Then A cxi B has state set QamB = x Qb, initial state 
iamb = («A,^s) and final weights Fa^b(q, q') = FA(q)F B (q'). Furthermore, 
there is a transition ((q,q'),(x,z),k x k',(r,r')) € 5awB iff there are tran- 
sitions (q, (x,y), k,r) £ 5a and (q' , (y, z), k' ,r') G <5b- This construction is 
similar to the standard intersection construction for DFAs; a proof that it 
indeed implements transduction composition @ is given in Appendix [A]. 

In the general case, we consider transducers A and B with labels over 
S ? x r ? and T ? x A ? , respectively, where A ? = A U {e}. |^| As shown in (|To|), 
the composition of A and B should have exactly one path for each pair of 
paths p in A and p' in B with 

v = p. lab. out = p' . lab. in . (12) 

for some string v G T* that we will call the composition string. In the e- 
free case, it is clear that p = t\, . . . ,i m , p' = t[, . . . ,t' m for some m and 
ti. lab. out = t-. lab. in. The pairing of ti with t[ is precisely what the e-free 
composition construction provides. In the general case, however, two paths 

3 It is easy to see that any transducer with transition labels in E* x F* is equivalent to 
a transducer with labels in F ? x A''. 
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(a) 
(b) 
(c) 



£:ti e:X] e:X] z:%\ 




(d) 

T2:e T2:e T2:e T2:e 




Figure 2: Transducers with e Labels 




Figure 3: Composition with Marked es 




Figure 4: Filter Transducer 
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p and p' satisfying ( |i~2| ) need not have the same number of transitions. Fur- 
thermore, there may be several ways to align e outputs in A and e inputs in 
B with staying in the same state in the opposite transducer. This is exempli- 
fied by transducers A and B in Figure |2|(a-b), and the corresponding naive 
composition in Figure |3[ The multiple paths from state (1, 1) to state (3, 2) 
correspond to different interleavings between taking the transition from 1 
to 2 in B and the transitions from 1 to 2 and from 2 to 3 in A, In the 
weighted case, including all those paths in the composition would in general 
lead to an incorrect total weight for the transduction of string abed to string 
da. Therefore, we need a method for selecting a single composition path for 
each pair of compatible paths in the composed transducer. 

The following construction, justified in Appendix |B|, achieves the desired 
result. For label I, define tti(1) = /.in and ^(Z) = Lout. Given a transducer 
T, compute Markj(T) from T by replacing the label of every transition t such 
that 7Ti(t.lab) = e with the new label I defined by H2-i{l) = vr2-j(t.lab) and 
7Ti(Z) = Ti, where Tj is a new symbol. In words, each e on the ith component 
of a transition label is replaced by T{. Corresponding to e transitions on one 
side of the composition we need to stay in the same state on the other side. 
Therefore, we define the operation Skipj(T) that for each state q of T adds a 
new transition (q, /, 1, q) where 7T2_i(7) = r« and 7Tj(/) = e. We also need the 
auxiliary transducer Filter shown in Figure |3], where the transition labeled 
x : x is shorthand for a set of transitions mapping x to itself (at no cost) for 
each x E T. Then for arbitrary transducers A and B, we have 

{A] o {B} = [Ski Pl (Mark 2 (A)) X Filter M Skip 2 (Marki(£))] 

For example, with respect to Figure [| we have A' = Skip 1 (Mark2(j4)) and 
B' = Skip 2 (Marki(i?)). The thick path in Figure || is the only one allowed 
by the filter transduction, as desired. In practice, the substitutions and 
insertions of Tj symbols performed by Markj and Skipj do not need to be 
performed explicitly, because the effects of those operations can be computed 
on the fly by a suitable implementation of composition with filtering. 

The filter we described is the simplest to explain. In practice, somewhat 
more complex filters, which we will describe elsewhere, help reduce the size 
of the resulting transducer. For example, the filter presented includes in the 
composition in states (2,1) and (3,1) on Figure ||, from which no final state 
can be reached. Such "dead end" paths can be a source of inefficiency when 
using the results of composition. 
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Figure 5: Models as Automata 



3 Speech Recognition 

We now describe how to represent a speech recognizer as a composition of 
transducers. Recall that we model the recognition task as the composition 
of a language O of acoustic observation sequences, a transduction A from 
acoustic observation sequences to phone sequences, a transduction D from 
phone sequences to word sequences and a weighted language M specifying 
the language model (see Figure |l|). Each of these can be represented as a 
finite-state automaton (to some approximation), denoted by the same name 
as the corresponding transduction in what follows. 

The acoustic observation automaton O for a given utterance has the 
form shown on Figure |5]a. Each state represents a fixed point in time t,, and 
each transition has a label, Oj, drawn from a finite alphabet that quantizes 
the acoustic signal between adjacent time points and is assigned probability 



The transducer A from acoustic observation sequences to phone se- 
quences is built from phone models. A phone model is a transducer from 
sequences of acoustic observation labels to a specific phone that assigns to 
each acoustic observation sequence the likelihood that the specified phone 
produced it. Thus, different paths through a phone model correspond to 
different acoustic realizations of the phone. Figure |5|b shows a common 
topology for phone models. A is then defined as the closure of the sum of 

4 For more complex acoustic distributions (for instance, continuous densities) we can 
instead use multiple transitions (ti-i, d,p(oi\d), ti) where d is an observation distribution 
and p(oi\d) the corresponding observation probability. 
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the phone models. 

The transducer D from phone sequences to word sequences is is built 
similarly to A. A word model is a transducer from phone sequences to the 
specified word that assigns to each phone sequence the likelihood that the 
specified word produced it. Thus, different paths through a word model 
correspond to different phonetic realizations of the word. Figure |5|c shows a 
typical topology for a word model. D is then defined as the closure of the 
sum of the word models. 

Finally, the acceptor M encodes the language model, for instance an re- 
gram model. Combining those automata, we obtain 7^(0 IX A CO D to M), 
which assigns a probability to each word sequence. The highest-probability 
path through that automaton estimates the most likely word sequence for 
the given utterance. 

The finite-state model of speech recognition that we have just described 
is hardly novel. In fact, it is equivalent to that presented in Q, in the sense 
that it generates the same weighted language. However, the transduction 
cascade approach presented here allows one to view the computations in 
new ways. 

For instance, because composition is associative, the computation of 
argmax^ 7T2(0 to A to D to M)(w) can be organized in a variety of ways. 
In a traditional integrated-search recognizer, a single large transducer is 
built in advance by R = A to D to M, and used in recognition to compute 
argmax w -K2(0 to R)(w) for each observation sequence O Q. This approach 
is not practical if the size of R exceeds available memory, as is typically the 
case for large-vocabulary speech recognition with n-gram language models 
for n > 2. In those cases, pruning may be interleaved with composition to 
to compute (an approximation of) ((O to ^4) to D) to M. Acoustic observa- 
tions are first transduced into a phone lattice represented as an automaton 
labeled by phones (phone recognition). The whole lattice typically too big, 
so the computation includes a pruning mechanism that generates only those 
states and transitions that appear in high-probability paths. This lattice 
is in turn transduced into a word lattice (word recognition), again possibly 
with pruning, which is then composed with the language model [|n|, [lTfl . 
The best approach depends on the specific task, which determines the size 
of intermediate results. By having a general package to manipulate weighted 
automata, we have been able to experiment with various alternatives. 

So far, our presentation has used context-independent phone models. In 
other words, the likelihood assigned by a phone model in A is assumed con- 
ditionally independent of neighboring phones. Similarly, the pronunciation 
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of each word in D is assumed independent of neighboring words. Therefore, 
each of the transducers has a particularly simple form, that of the closure 
of the sum of (inverse) substitutions. That is, each symbol in a string on 
the output side replaces a language on the input side. This replacement of 
a symbol from one alphabet (for example, a word) by the automaton that 
represents its substituted language from a over a finer-grained alphabet (for 
example, phones) is the usual stage-combination operation for speech rec- 
ognizers 

However, it has been shown that context-dependent phone models, which 
model a phone in the context of its adjacent phones, provide substantial im- 
provements in recognition accuracy [1C]. Further, the pronunciation of a 
word will be affected by its neighboring words, inducing context dependen- 
cies across word boundaries. 

We could include context-dependent models, such as triphone models, 
in our presentation by expanding our 'atomic models' in A to one for every 
phone in a distinct triphonic context. Each model will have the same form 
as in Figure ||b, but it will be over an enlarged output alphabet and have 
different likelihoods for the different contexts. We could also try to directly 
specify D in terms of the new units, but this is problematic. First, even 
if each word in D had only one phonetic realization, we could not directly 
substitute its the phones in the realization by their context-dependent mod- 
els, because the given word may appear in the context of many different 
words, with different phones abutting the given word. This problem is com- 
monly alleviated by either using left (right) context-independent units at 
the word starts (ends), which decreases the model accuracy, or by building 
a fully context-dependent lexicon and using special machinery in the recog- 
nizer to insure the correct models are used at word junctures. In either case, 
we can no longer use compact lexical entries with multiple pronunciations 
such as that of Figure ||c. Those approaches attempt to solve the context- 
dependency problem by introducing new substitutions, but substitutions are 
not really appropriate for the task. 

In contrast, context dependency can be readily represented by a simple 
transducer. We leave D as defined before, but interpose a new transducer 
C between A and D that convert between context-dependent and context- 
independent units, that is, we now compute argmax w 7^(0 CX A IX C \x\ 
D cxi M)(w). A possible form for C is shown in Figure [6|. For simplicity, 
we show only the portion of the transducer concerning two hypothetical 
phones x and y. The transducer maps each context-dependent model p/l_r, 
associated to phone p when preceded by I and followed by r, to an occur- 
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Figure 6: Context-Dependency Transducer 



rence of p which is guaranteed to be preceded by I and followed by r. To 
ensure this, each state labeled p.q represents the context information that 
all incoming transitions correspond to phone p, and all outgoing transitions 
correspond to phone q. Thus we can represent context-dependency directly 
as a transducer, without needing specialized context-dependency code in the 
recognizer. More complex forms of context dependency such as those based 
on classification trees over a bounded neighborhood of the target phone can 
too be compiled into appropriate transducers and interposed in the recog- 
nition cascade without changing any aspect of the recognition algorithm. 
Transducer determinization and minimization techniques fl2| can be used 
to make context-dependency transducers as compact as possible. 

4 Implementation 

The transducer operations described in this paper, together with a variety of 
support functions, have been implemented in C. Two interfaces are provided: 
a library of functions operating on an abstract finite-state machine datatype, 
and a set of composable shell commands for fast prototyping. The modular 
organization of the library and shell commands follows directly from their 
foundation in the algebra of rational operations, and allows us to build new 
application-specific recognizers automatically. 

The size of composed automata and the efficiency of composition have 
been the main issues in developing the implementation. As explained earlier, 
our main applications involve finding the highest-probability path in com- 
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posed automata. It is in general not practical to compute the whole compo- 
sition and then find the highest-probability path, because in the worst case 
the number of transitions in a composition grows with the product of the 
numbers of transitions in the composed automata. Instead, we have devel- 
oped a lazy implementation of composition, in which the states and arcs of 
the composed automaton are created by pairing states and arcs in the com- 
position arguments only as they are required by some other operation, such 
as search, on the composed automaton [18]. The use of an abstract datatype 
for automata facilitates this, since functions operating on automata do not 
need to distinguish between concrete and lazy automata. 

The efficiency of composition depends crucially on the efficiency with 
which transitions leaving the two components of a state pair are matched to 
yield transitions in the composed automaton. This task is analogous to doing 
a relational join, and some of the sorting and indexing techniques used for 
joins are relevant here, especially for very large alphabets such as the words 
in large-vocabulary recognition. The interface of the automaton datatype 
has been carefully designed to allow for efficient transition matching while 
hiding the details of transition indexing and sorting. 



5 Applications 

We have used our implementation in a variety of speech recognition and 
language processing tasks, including continuous speech recognition in the 
60,000-word ARPA North American Business News (NAB) task and 
the 2,000-word ARPA ATIS task, isolated word recognition for directory 
lookup tasks, and segmentation of Chinese text into words fl2~H| . 

The NAB task is by far the largest one we have attempted so far. In our 
1994 experiments [|l~7f|, we used a 60, 000- word vocabulary, and several very 
large automata, including a phone-to-syllable transducer with 5 x 10 5 tran- 
sitions, a syllable-to-word (dictionary) transducer with 10 5 transitions and a 
language model (5-gram) with 3.4 x 10 7 transitions. We are at present exper- 
imenting with various improvements in modeling and in the implementation 
of composition, especially in the filter, that would allow us to use directly 
the lazy composition of the whole decoding cascade for this application in a 
standard time-synchronous Viterbi decoder. In our 1994 experiments, how- 
ever, we had to break the cascade into a succession of stages, each generating 
a pruned lattice (an acyclic acceptor) through a combination of lazy compo- 
sition and graph search. In addition, relatively simple models are used first 
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(context-independent phone models, bigram language model) to produce a 
relatively small pruned word lattice, which is then intersected with the com- 
position of the full models to create a rescored lattice which is then searched 
for the best path. That is, we use an approximate word lattice to limit the 
size of the composition with the full language and phonemic models. This 
multi-pass decoder achieved around 10% word-error rate in the main 1994 
NAB test, while requiring around 500 times real-time for recognition. 

In our more recent experiments with lazy composition in synchronous 
Viterbi decoders, we have been able to show that lazy composition is as fast 
or faster than traditional methods requiring full expansion of the composed 
automaton in advance, while requiring a small fraction of the space. The 
ARPA ATIS task, for example, uses a context transducer with 40,386 tran- 
sitions, a the dictionary with 4,816 transitions a class-based variable-length 
ra-gram language model [16| with 359,532 transitions. The composition of 
these three automata would have around 6 x 10 6 transitions. However, for 
a typical sentence only around 5% of those transitions are actually visited 



6 Further Work 

We have been investigating a variety of improvements, extensions and ap- 
plications of the present work. With Emerald Chung, we have been refining 
the connection between a time-synchronous Viterbi decoder and lazy com- 
position to improve time and space efficiency. With Mehryar Mohri, we 
have been developing improved composition filters, as well as exploring on- 
the-fly and local determinization techniques for transducers and weighted 
automata [^] to decrease the impact of nondeterminism on the size (and 
thus the time required to create) composed automata. Our work on the im- 
plementation has also been influenced by applications to the compilation of 
weighted phonological and morphological rules and by ongoing research on 
integrating speech recognition with natural-language analysis and transla- 
tion. Finally, we are investigating applications to local grammatical analysis, 
in which transducers have been often used but not with weights. 
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A Correctness of e-Free Composition 

As shown in Section |2.4| (|l~C|), we have 

(L A (q) oL B (q'))(r,t) = £ £ E F (p) X Hp') -(13) 

8er *peP<["\q)i/eP™W) 

Clearly, for e-free transducers the variables r,s,t,p and p' in this equation 
satisfy the constraint |r| = |s| = \t\ = \p\ = \p'\ = n for some n. This 
allows us to show the correctness of the composition construction for e-free 
automata by induction on n. Specifically, we shall show that for any q € Q A 
and q' <G Q B 

L A ^ B {q,q') = L A {q)oL B {q) ■ (14) 
For n = 0, from ((l^) and the composition construction we obtain 

(L A (q)oL B (q'))(e,e) = F A (q) x F B (q') 

= F A ^ B (q,q') 
= F A ^ B (e,e) 

as needed. 

Assume now that L Ac><lB (m,m')(u,w) = (L A (m) o L B (m'))(u, w) for any 
m G Qa, m' € Qb, u G S* and w G A* with \u\ = \w\ < n. Let r = xu 
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and t = zw, with x G S and z G A. Then by and the composition 
construction we have 

(L A (p) oL B (q))(xu, zw) 
= E y er E^er* S p6 p(*«.v«)^ Ep/gpCv*.*™)^ F(p) x ^(p ) 

E ^ A E 

(q,(x,y),k,m)eS A (q' ,(y,z),k' ,in')e8 B 

kxk'x (£„ er * E IeP (-,«) (ro) E^ P (.,») (m0 ^(0 x F(Z')) 
E 

3 x (£„ G r* E Ie p(«.«) (m) £,, eP («,») (m0 F (0 x F (0) 

= E(( g ,g') J (a! > *)J,(in,TO'))6*Axifl J X C^aM o L B (m'))(u, w) 

= J2((q, q '),(x,z),j,(m,m'))e5 A> , B 3 x L A ^Es{m, m')(u, w) 

= E((g,g'),(x, Z ),j,(m,m'))G5Axfl J X (E^p^)^^,) ^Nfl(fl)) 

= ^heP A Zr\™') WA ™ B( - h) 

= L A ^ B (q,q')(xu,zw) 

This shows (|i~4| ) for e-free transducers, and as a particular case 

[AkB] = [A] o [5] , 

which states that transducer composition correctly implements transduction 
composition. 



B General Composition Construction 

For any transition t in A or B, we define 
Market) = 



Tj if 7Tj(i.lab) = e 

7ii(t.lab) otherwise 



where each T{ is a new symbol not in T. This can be extended to a path 
p = ti, . . . , t m in the obvious way by Markj(p) = Markj(ii) • • • Markj(t m ). If 



p and p' satisfy (12), there will be m, n > k such that p = ti,...,t m , p' = 
t'i, . . . , t' n , v = y\ ■ ■ ■ yk and v = p. lab. out = p'. lab. in. Therefore, we will have 
Mark2(p) = uoymi ■ ■ ■ Uk-iVkUk where U{ G {72}* and \uq ■ ■ -Uk\ = m — k, 
and Marki(p') = voy±vi ■ ■ ■ v^-iyuVk where V{ £ {ti}* and \vq • • • vy.\ = n — k. 

We will need the following standard definition of the shuffle s * s' of two 
languages L, V C Y*: 

L-k L' = {u±vi ■ ■ ■ uivi\u\ • ■ ■ ui G L,v\ ■ ■ ■ Vi G L'} 



22 



Then it is easy to see that fll2| ) holds iff 

J=({Mark 2 (p)}*{ri}*)n({Marki(p')}*{7a}*)^0 . (15) 

Each composition string v € J has the form 

v = voyivi ■ ■ ■ v k _iy k v k (16) 

for yi € T and Vi € {ti,t 2 }*. Furthermore, by construction, any string 
v blli v i ' ' ' v 'k~iVk v 'k-> where each v\ is derived from V{ by commuting n in- 
stances with r 2 instances, is also in J. 

Consider for example the transducers A shown in Figure |2|a and B shown 
in Figure [2|b. For path p from state to state 4 in A and path p' from state 
to state 3 in B we have the following equalities: 

Markup) = ar 2 T 2 d 
Marki(p') = ar\d 

!aT 1 T 2 T 2 d, 
aT 2Tl T 2 d, 
ar 2 T 2 T\d 

Therefore, p and p' satisfy (12), allowing {AJ o [B] to map abed to dea. It is 



also straightforward to see that, given the transducers A 1 in Figure §c and 
B' in Figure |2|d, we have 

{Mark 2 (p)}*{T!}* = {p.lab.out|p G P A /(0)} 
{MarkiO/j}*{r 2 }* = {p'.lab.in|p' e P B /(0)} 

Since there are no e labels on the output side of A' or the input side of B' , 
we can apply to them the e-free composition construction, with the result 
shown in Figure |3[ Each of the paths from the initial state to the final 
state corresponds to a different composition string in {Mark 2 (p)} * {ti}* D 

{Markup')} *{t 2 }*- 

The transducer A' txi B' pairs up exactly the strings it should, but it 
does not correctly implement [^4] o [£?] in the general weighted case. The 
construction described so far allows several paths in A' M B' corresponding 
to each pair of paths from A and B. Intuitively, this is possible because t\ 
and r 2 are allowed to commute freely in the composition string. But if one 
pair of paths p from A and p' from B leads to several paths in A' X B' , 
the weights from the e-transitions in A and B will appear multiple times in 
the overall weight for going from (p.src,p'.src) to (p.dst,p'.dst) in A' x B' . 
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If the semiring sum operation is not idempotent, that leads to the wrong 
hts in (0). 

To achieve the correct path multiplicity, we interpose a transducer Filter 
between A' and B' in a 3-way composition (A 1 , Filter, B'). The Filter 
transducer is shown in Figure ||, where the transition labeled x : x represents 
a set of transitions mapping x to itself for each x E T. The effect of Filter 
is to block any paths in A' txi B' corresponding to a composition string 
containing the substring T2T\. This eliminates all the composition strings 
(|l6|) in (|l5| ) except for the one with Vi G {t\}*{t2}*, which is guaranteed 
to exist since J in (|l5|) allows all inter leavings of t\ and T2, including the 
required one in which all T2 instances must follow all T\ instances. For 
example, Filter would remove all but the thick-lines path in Figure £3, as 
needed to avoid incorrect path multiplicities. 
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